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Abstract 

Divergence estimators based on direct approximation of density-ratios without going 
through separate approximation of numerator and denominator densities have been success- 
fully applied to machine learning tasks that involve distribution comparison such as outlier 
detection, transfer learning, and two-sample homogeneity test. However, since density-ratio 
functions often possess high fluctuation, divergence estimation is still a challenging task in 
practice. In this paper, we propose to use relative divergences for distribution compari- 
son, which involves approximation of relative density-ratios. Since relative density-ratios 
are always smoother than corresponding ordinary density-ratios, our proposed method is 
favorable in terms of the non-parametric convergence speed. Furthermore, we show that 
the proposed divergence estimator has asymptotic variance independent of the model com- 
plexity under a parametric setup, implying that the proposed estimator hardly overfits 
even with complex models. Through experiments, we demonstrate the usefulness of the 
proposed approach. 

Keywords: Density ratio, Pearson divergence, Outlier detection, Two-sample homogene- 
ity test, Unconstrained least-squares importance fitting. 



1. Introduction 



Comparing probability distributions is a fun damental task in s t atistical data proc essing. 



It can be used for, e.g. , outlier detection (ISmola et al.1. 120 09; Hid o et al.l . 12011 



sample homogeneity test (jGretton et al- — 
(jShimodairal . bood i lSudvama et all l2007h 



20071 : ISugiyama et alll201ll ). and transfer learning 



two- 
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A standard approach to comparing probability densities p(x) and p'(x) would be to 
estimate a divergence from p(x) to p'(x), such as the Kullback-Leibler (KL) divergence 
(jKullback and Leiblerl . Il95ll ): 



KL[p{x),p'(x)} :-- 



MM) 



p(x)dx. 



A naive way to estimate the KL divergence is to separately approximate the densities p(x) 
and p'{x) from data and plug the estimated densit ies in the abo ve definition. However, 
since density estimation is known to be a hard task ( Vapnik . 19981 ). this approach does not 
work well unless a good parametric model is available. Recently, a divergence estimation 
approach which directly approximates the density ratio, 



r{x) :- 



p'(x) 



without going through separate approx i matio n of densities p(x) and p'(x) has been proposed 
( Sugiyama et al. . 20081 ; Nguyen et all . 120101 ). Such density-ratio approximation methods 
were proved to achieve the optimal non-parametric convergence rate in the mini-max sense. 

However, the KL divergence estimation via density-ratio approximation is computation- 
ally rather expensive due to the non-linearity introduced by the 'log' te rm. To cope w ith 
this problem, another divergence called the Pearson (PE) divergence (Pearson, 1900l ) is 
useful. The PE divergence from p(x) to p'(x) is defined as 



PE[p(x),p'(x) 



P{x) 
p'(x) 



1 ) p'(x)dx. 



The PE divergence is a squared-loss variant of the KL divergence, and they both belong 
to t he class of the Ali-Si l vey-Csis z ar div ergences (which is also known as the f -divergences, 



sec 



Ali and SilvevL Il966l : ICsiszarl . Il967h . Thus, the PE and KL divergences share similar 
properties, e.g., they are non-negative and vanish if and only if p(x) = p'(x). 

Similarly to the KL divergence estimation, the PE divergence can als o be accurately 



estimated based on density-ratio approximation ( Kanamori et al. . 20091 ): the density 



ratio approximator called unconstrained least-squares importance fitting (uLSIF) gives the 
PE divergence estimator analytically, which can be computed just by solving a system 
of linear equations. The practical usefulness of the uLSIF-based PE dive rgence esti- 



mato r was demonstrated in various applications such as outlier detection (jHido et al. 



2011), two-sample homogene ity test ( Sugiyama et al. . 20 111 ), and dimensionality reduction 
([Suzuki and Sugiyama . l2010h . 

In this paper, we first establish the non-parametric convergence rate of the uLSIF-based 
PE divergence estimator, which elucidates its superior theoretical properties. However, it 
also reveals that its convergence rate is actually governed by the 'sup'-norm of the true 
density-ratio function: m&x x r(x). This implies that, in the region where the denominator 
density p'[x) takes small values, the density ratio r(x) = p(x)/p'(x) tends to take large 
values and therefore the overall convergence speed becomes slow. More critically, density 
ratios can even diverge to infinity un der a rather s i mple setting, e.g., when the ratio of 



two Gaussian functions is considered ( Cortes et al. . 2010l ). This makes the paradigm of 



divergence estimation based on density-ratio approximation unreliable. 
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In order to overcome this fundamental problem, we propose an alternative approach to 
distribution comparison called a-relative divergence estimation. In the proposed approach, 
we estimate the quantity called the a-relative divergence, which is the divergence from p(x) 
to the a-mixture density ap(x) + (1 — a)p'(x) for < a < 1. For example, the a-relative 
PE divergence is given by 

PE„[p(x),p'(x)] := PE\p{x),ap(x) + (1 - a)rf(xj\ 

= \ / L W+ t- a v W - 2 W") + <» - d - 

We estimate the a-relative divergence by direct approximation of the a-relative density- 
ratio: 

( \ ._ K^O 

ap(cc) + (1 — a)p'(a:) 

A notable advantage of this approach is that the a-relative density-ratio is always 
bounded above by 1/a when a > 0, even when the ordinary density-ratio is unbounded. 
Based on this feature, we theoretically show that the a-relative PE divergence estima- 
tor based on a-relative density-ratio approximation is more favorable than the ordinary 
density-ratio approach in terms of the non-parametric convergence speed. 

We further prove that, under a correctly-specified parametric setup, the asymptotic 
variance of our a-relative PE divergence estimator does not depend on the model complexity. 
This means that the proposed a-relative PE divergence estimator hardly overfits even with 
complex models. 

Through extensive experiments on outlier detection, two-sample homogeneity test, and 
transfer learning, we demonstrate that our proposed a-relative PE divergence estimator 
compares favorably with alternative approaches. 

The rest of this paper is structured as follows. In Section [2j our proposed relative 
PE divergence estimator is described. In Section [3j we provide non-parametric analysis 
of the convergence rate and parametric analysis of the variance of the proposed PE diver- 
gence estimator. In Section [H we experimentally evaluate the performance of the proposed 
method on various tasks. Finally, in Section [5j we conclude the paper by summarizing our 
contributions and describing future prospects. 

2. Estimation of Relative Pearson Divergence via Least-Squares Relative 
Density-Ratio Approximation 

In this section, we propose an estimator of the relative Pearson (PE) divergence based on 
least-squares relative density-ratio approximation. 

2.1 Problem Formulation 

Suppose we are given independent and identically distributed (i.i.d.) samples {xi}^ =1 from 
a (i-dimensional distribution P with density p(x) and i.i.d. samples {a^}™ =1 from another 



3 



Yamada, Suzuki, Kanamori, Hachiya, and Sugiyama 



(i-dimensional distribution P' with density p'{x): 

{xi\U ~ d ' P, 

{x , } n Li i-td. p, 

The goal of this paper is to compare the two underlying distributions P and P' only using 
the two sets of samples {xi\™ =l and {a^}™ =1 . 

For < a < 1, let q a (x) be the a-mixture density of p(x) and p'{x): 

q a (x) := ap(x) + (1 - a)p'(x). 

Let r a (£c) be the a-relative density-ratio of p(aj) and p'(x): 

r ( X ) .= = ZM. (D 

Ql ; • ap(x) + (1 - a)p> (x) q a (xY { 1 

We define the a-relative PE divergence from p(x) to p'(x) as 

PE Q := ±E qa{x) [(r a (x) - l) 2 ] , (2) 
where M p r x \[f(x)] denotes the expectation of f(x) under p(x): 

K p(x) [f(x)] = J f(x)p(x)dx. 

When a = 0, PEq, is reduced to the ordinary PE divergence. Thus, the a-relative PE 
divergence can be regarded as a 'smoothed' extension of the ordinary PE divergence. 

Below, we give a method for estimating the a-relative PE divergence based on the 
approximation of the a-relative density-ratio. 

2.2 Direct Approximation of a-Relative Density-Ratios 

Here, we describe a method for approximating the a-relative density-ratio ([1]). 
Let us model the a-relative density-ratio r a (x) by the following kernel model: 

n 

g(x;0) := 'Y^9 i K(x,X(), 

e=i 

where := (9i,... ,0 n ) T are parameters to be learned from data samples, T denotes the 
transpose of a matrix or a vector, and K(x, x') is a kernel basis function. In the experiments, 
we use the Gaussian kernel: 

K(x, x') = exp 
where a (> 0) is the kernel width. 
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The parameters in the model g(x;6) are determined so that the following expected 
squared-error J is minimized: 

J{6) :=\^ qa{x) [{g{x-e)-r a {x)f 

= ^E p(x) [g(x;0) 2 ] + £-Z^Ep, (a0 [g(x;0) 2 ] - E p(x) [g(x;0)] + Const., 

where we used r a (x)q a (x) = p(x) in the third term. Approximating the expectations by 
empirical averages, we obtain the following optimization problem: 



6 := argmin 



-0 T H0-h T + -6 T 



(3) 



where a penalty term X0 T 6/2 is included for regularization purposes, and A (> 0) denotes 
the regularization parameter. H is the n x n matrix with the (^,^')-th element 



Hti< := - y^K(xi,Xi)K(xi,xi') H ^■y^K(x' j ,xe)K(x' j ,xe> 



h is the n-dimensional vector with the l-th. element 

1 n 

hi := - K(xi,x e ). 
n 



(4) 



It is easy to confirm that the solution of Eq.@ can be analytically obtained as 

e = {H + \i n )- 1 h, 

where I n denotes the n-dimensional identity matrix. Finally, a density-ratio estimator is 
given as 



r a (x) := g{x;6) = ^20 e K(x,x e ) 



(5) 



When a = 0, the above method is reduced to a d irect density-rat i o esti mator called 
unconstrained least-squares importance fitting (uLSIF; Kanamori et al. . 20091 ). Thus, the 
above method can be regarded as an extension of uLSIF to the a-relative density-ratio. For 
this reason, we refer to our method as relative uLSIF (RuLSIF). 

The performance of RuLSIF depends on the choice of the kernel function (the kernel 
width a in the case of the Gaussian kernel) and the regularization parameter A. Model 
selection of RuLSIF is possible based on cross-valid ation with respe c t to t he squared-error 
criterion J, in the same way as the orig inal uLSIF (jKanamori et all 120091 ) . 

A MATLAB® implementation of RuLSIF is available from 



(made public after acceptance) 
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2.3 a-Relative PE Divergence Estimation Based on RuLSIF 

Using an estimator of the a-relative density-ratio r a (x), we can construct estimators of 
the a-relative PE divergence (J2J. After a few lines of calculation, we can show that the 
a-relative PE divergence ([2]) is equivalently expressed as 

PE a = ™Ep (a0 [r a {x) 2 ] - (1 ~ a) E p , (a;) [r a {xf] + E p(x) [r a (x)] - i 
1 . , 1 

Note that the first li ne can also be obt ained via Legendre-Fenchel convex duality of the 
divergence functional ( Rockafellar . 1970l ). 

Based on these expressions, we consider the following two estimators: 

1=1 j=l 1=1 

~ 1 n 1 

PE„ (7) 

We note that the a-relative PE divergence ([2]) can have further different expressions than 
the above ones, and corresponding estimators can also be constructed similarly. However, 
the above two expressions will be particularly useful: the first estimator PE a has superior 
theoretical properties (see Section [3J and the second one PE a is simple to compute. 



2.4 Illustrative Examples 

Here, we numerically illustrate the behavior of RuLSIF ([5]) using toy datasets. Let the 
numerator distribution be P = iV(0, 1), where N{n,a 2 ) denotes the normal distribution 
with mean \i and variance a 2 . The denominator distribution P' is set as follows: 

(a) P' = N(0, 1): P and P' are the same. 

(b) P' = N(0, 0.6): P' has smaller standard deviation than P. 

(c) P' = N(0, 2): P' has larger standard deviation than P. 

(d) P' = N(0.5, 1): P and P' have different means. 

(e) P' = 0.95iV(0, 1) + 0.05iV(3, 1): P' contains an additional component to P. 

We draw n = n' = 300 samples from the above densities, and compute RuLSIF for a = 0, 
0.5, and 0.95. 

Figure Q] shows the true densities, true density-ratios, and their estimates by RuLSIF. As 
can be seen from the graphs, the profiles of the true a-relative density-ratios get smoother 
as a increases. In particular, in the datasets (b) and (d), the true density-ratios for a = 
diverge to infinity, while those for a = 0.5 and 0.95 are bounded (by 1/a). Overall, as a 
gets large, the estimation quality of RuLSIF tends to be improved since the complexity of 
true density-ratio functions is reduced. 
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(e) P' = 0.95iV(0, 1) + 0.05iV(3, 1): P' contains an additional component to P. 

Figure 1: Illustrative examples of density-ratio approximation by RuLSIF. From left to 
right: true densities (P = N(0, 1)), true density-ratios, and their estimates for 
a = 0, 0.5, and 0.95. 
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Note that, in the dataset (a) where p(x) = p'(x), the true density-ratio r a (x) does not 
depend on a since r a (x) = 1 for any a. However, the estimated density-ratios still depend 
on a through the matrix H (see Eq.(j4])). 

3. Theoretical Analysis 

In this section, we analyze theoretical properties of the proposed PE divergence estima- 
tors. More specifically, we provide non-parametric analysis of the convergence rate in Sec- 
tion 13.11 and parametric analysis of the estimation variance in Section 13.21 Since our 
theoretical analysis is highly technical, we focus on explaining practical insights we can 
gain from the theoretical results here; we describe all the mathematical details of the non- 
parametric convergence-rate analysis in Appendix [2] and the parametric variance analysis 
in Appendix [Bj 

For theoretical analysis, let us consider a rather abstract form of our relative density- 
ratio estimator described as 



argmin 



n (-\ \ n 1 " \ 

i=l j = l i=l 



(8) 



where Q is some function space (i.e., a statistical model) and R{-) is some regularization 
functional. 

3.1 Non-Parametric Convergence Analysis 

First, we elucidate the non-parametric convergence rate of the proposed PE estimators. 
Here, we practically regard the function spa ce Q as an infinite-dimensional reproducing 
kernel Hilbert space (RKHS; lAronszainl . llQfsd ) such as the Gaussian kernel space, and R(-) 
as the associated RKHS norm. 

3.1.1 Theoretical Results 

Let us represent the complexity of the function space Q by 7 (0 < 7 < 2); the larger 7 is, 
the more complex the function class Q is (see Appendix [A] for its precise definition) . We 
analyze the convergence rate of our PE divergence estimators as n := min(n,n') tends to 
infinity for A = A^ under 

Afi->o(l) and Ar 1 = (n 2 /( 2 +T)). 

The first condition means that A^ tends to zero, but the second condition means that its 
shrinking speed should not be too fast. 

Under several technical assumptions detailed in Appendix [Aj we have the following 
asymptotic convergence results for the two PE divergence estimators PE Q © and PE Q ([7]): 

PE a - PE a = OJn-^cWraWoo + Xn max(l, R(r a ) 2 )), (9) 
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and 

PE a -PE Q = O p (\y 2 \\r a \\)l 2 max{l, R(r a )} 

+ A^max{lJ|r Q ||^/ 2 )/ 2 , J R(r a )||r Q ||^/ 2 )/ 2 , J R(r Q )}), (10) 

where O p denotes the asymptotic order in probability, 

c := (1 + a)^Y p{x) [r a (x)} + (1 - a)^Y p , {x) [r a (x)}, (11) 

and Y p ^[f(x)} denotes the variance of f(x) under p(x): 

Y p{x) [f(x)} = J (f(x)- J f(x)p(x)dx\ P (x)dx. 
3.1.2 Interpretation 

In both Eq.Q and Eq.(|10p. the coefficients of the leading terms (i.e., the first terms) of the 
asymptotic convergence rates become smaller as H^Hoo gets smaller. Since 



(a+ (1 - a)/r(x))' 



< j- for a > 0, 



oc 



larger a would be more preferable in terms of the asymptotic approximation error. Note 
that when a = 0, H^Hoo can tend to infinity even under a simple setting that the ratio of 



two Gaussian functions is considered ([Cortes et all |2010| . see also the numerical examples 



in Section [2.41 of this paper). Thus, our proposed approach of estimating the a-relative PE 
divergence (with a > 0) would be more advantageous than the naive approach of estimating 
the plain PE divergence (which corresponds to a = 0) in terms of the non-parametric 
convergence rate. 

The above results also show that PE a and PE a have different asymptotic convergence 
rates. The leading term in Eq.(|9]) is of order n -1 / 2 , while the leading term in Eq. (110j) is of 
order A^ 2 , which is slightly slower (depending on the complexity 7) than n _1//2 . Thus, PE Q 
would be more accurate than PE Q in large sample cases. Furthermore, when p{x) = p'(x), 
^p(x) [ r a(x)] = holds and thus c = holds (see Eq.([TT])). Then the leading term in 

Eq.([9]) vanishes and therefore PE a has the even faster convergence rate of order X n , which 
is slightly slower (depending on the complexity 7) than n _1 . Similarly, if a is close to 1, 
r a (x) « 1 and thus c « holds. 

When n is not large enough to be able to neglect the terms of o(n -1 / 2 ), the terms of 
0(\n) matter. If ||r a ||oo and R(r a ) are large (this can happen, e.g., when a is close to 0), 
the coefficient of the 0(A^)-term in Eq.([9]) can be larger than that in Eq, (|10p . Then PE Q 
would be more favorable than PE a in terms of the approximation accuracy. 

3.1.3 Numerical Illustration 

Let us numerically investigate the above interpretation using the same artificial dataset as 
Section El 
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Figure 2: Illustrative examples of divergence estimation by RuLSIF. From left to right: true 
density-ratios for a = 0, 0.5, and 0.95 (P = N(0, 1)), and estimation error of PE 
divergence for a = 0, 0.5, and 0.95. 
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Figure [2] shows the mean and standard deviation of PE Q and PEq, over 100 runs for 
a = 0, 0.5, and 0.95, as functions of n (= n' in this experiment). The true PE Q (which 
was numerically computed) is also plotted in the graphs. The graphs show that both the 
estimators PE Q and PE a approach the true PE Q as the number of samples increases, and 
the approximation error tends to be smaller if a is larger. 

When a is large, PE Q tends to perform slightly better than PE a . On the other hand, 
when a is small and the number of samples is small, PE a slightly compares favorably with 
PE a . Overall, these numerical results well agree with our theory. 



3.2 Parametric Variance Analysis 

Next, we analyze the asymptotic variance of the PE divergence estimator PE 
parametric setup. 



under a 



3.2.1 Theoretical Results 

As the function space Q in Eq.([8]), we consider the following parametric model: 

G = {g(x-O) | e C K 6 }, 

where b is a finite number. Here we assume that the above parametric model is correctly 
specified, i.e., it includes the true relative density-ratio function r a (x): there exists 6* such 
that 

g(x;0*) = r a (x). 



Here, we use RuLSIF without regularization, i.e., A = in Eq.([8|). 

Let us denote the variance of PE a ([6]) by V[PEJ, where randomness comes from the 
draw of samples {xi}™ =1 and {x'A™ =l . Then, under a standard regularity condition for the 

asymptotic normality ^see Section 3 of Ivan der Vaartl B. V[PE Q ] can be expressed and 
upper-bounded as 



V[PE a ] = -V p(b0 



ar a (xY 



1 



— y p '(x) 
n' v ' 



(1 - a)r a {xf 



< 



a\\oo 

n 
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ce 2 \\r, 11 



■in 



I > 1 1 TO | (1 01 



,2 1 1 ||4 
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1 1 
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n n' 



+ (12) 
v n n' 1 



(13) 



Let us denote the variance of PE a by V[PE a ]. Then, under a stand ard regularity 
condition for the asymptotic normality (see Section 3 of Ivan der Vaartl . lioool ). the variance 
of PE Q is asymptotically expressed as 



V[PE 



-V, 



n 



p(x) 



r a + (1 - ar Q )E p(a;) [V(7] T J7- 1 V(7 



+ 



n' 



(l-a)r Q E p(a;) [V 5 ] T J7- 1 V 5 



+ o 



1 1 



n n 



(14) 
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where Vg is the gradient vector of g with respect to 6 at = 0*, i.e., 

(Vg(x;0 ))j = — — . 

The matrix U a is defined by 

U a = aE p{x) [VgVg T ] + (1 - a)E p , {x) [VgVg T ]. 
3.2.2 Interpretation 

Eq.(|I2J) shows that, up to 0(±, the variance of PE a depends only on the true relative 
density-ratio r a (x), not on the estimator of r a (x). This means that the model complexity 
does not affect the asymptotic variance. Therefore, overfitting would hardly occur in the 
estimation of the relative PE divergence even when complex models are used. We note 
that the above superior property is applicable only to relative PE divergence estimation, 
not to relative density-ratio estimation. This implies that overfitting occurs in relative 
density-ratio estimation, but the approximation error cancels out in relative PE divergence 
estimation. 

On the other hand, Eq. (|14p shows that the variance of PE Q is affected by the model Q, 
since the factor E, p ^[V g] T U^V g depends on the model complexity in general. When the 
equality 

E p{x) [Vg} T U- 1 Vg(x-e*) = r a (x) 

holds, the variances of PE a and PE Q are asymptotically the same. However, in general, the 
use of PE a would be more recommended. 

Eq. (fT3j) shows that the variance V[PE Q ] can be upper-bounded by the quantity depend- 
ing on H^o-Hoo? which is monotonically lowered if ||^ck||oo 

is reduced. Since HtqUoo mono- 
tonically decreases as a increases, our proposed approach of estimating the a-relative PE 
divergence (with a > 0) would be more advantageous than the naive approach of estimating 
the plain PE divergence (which corresponds to a = 0) in terms of the parametric asymptotic 
variance. 



3.2.3 Numerical Illustration 

Here, we show some numerical results for illustrating the above theoretical results using the 
one-dimensional datasets (b) and (c) in Section 12.41 Let us define the parametric model as 



Gk = { g(x;9) 



r(x; 6) 



ar(x; 0) + 1 — a 



r{x 



;0)=exph]^Ue 



(15) 



£=0 



The dimension of the model Qk is equal to k + 1. The a-relative density-ratio r a (x) can be 
expressed using the ordinary density-ratio r(x) = p(x)/p'(x) as 



r a {x) 



r{x) 
ar(x) + 1 — a 
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Figure 3: Standard deviations of PE estimators for dataset (b) (i.e., P = N(0, 1) and 
P' = N(Q, 0.6)) as functions of the sample size n = n'. 
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Figure 4: Standard deviations of PE estimators for dataset (c) (i.e., P = iV(0, 1) and P' = 
iV(0, 2)) as functions of the sample size n = n f . 
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Thus, when k > 1, the above model Qj. includes the true relative density-ratio r a {x) of the 
datasets (b) and (c). We test RuLSIF with a = 0.2 and 0.8 for the model (|15p with degree 
k = 1,2, ... ,8. The parameter 9 is learned so that Eq.(|8]) is minimized by a quasi-Newton 
method. 

The standard deviations of PE a and PE Q for the datasets (b) and (c) are depicted in 
Figure [3] and Figure HJ respectively. The graphs show that the degree of models does not 
significantly affect the standard deviation of PEq, (i.e., no overfitting) , as long as the model 
includes the true relative density-ratio (i.e., k > 1). On the other hand, bigger models tend 
to produce larger standard deviations in PE a . Thus, the standard deviation of PE Q more 
strongly depends on the model complexity. 

4. Experiments 

In this section, we experimentally evaluate the performance of the proposed method in 
two-sample homogeneity test, outlier detection, and transfer learning tasks. 

4.1 Two-Sample Homogeneity Test 

First, we apply the proposed divergence estimator to two-sample homogeneity test. 
4.1.1 Divergence-Based Two-Sample Homogeneity Test 

Given two sets of samples X = {ajj}™ =1 P and X' = {x'A^i P' , the goal of the 
two-sample homogeneity test is to test the null hypothesis that the probability distributions 
P and P' are the same against its complementary alternative (i.e., the distributions are 
different). 

By using an estimator Div of some divergence between the two distributions P and P', 
homogeneity of two d i stribu tions can be tested based on the permutation test procedure 
(jEfron and Tibshirani , 1993) as follows: 



• Obtain a divergence estimate Div using the original datasets X and X'. 

• Randomly permute the \X U X'\ samples, and assign the first \X\ samples to a set X 
and the remaining \X'\ samples to another set X'. 

• Obtain a divergence estimate Div using the randomly shuffled datasets X and X' (note 
that, since X and X' can be regarded as being drawn from the same distribution, Div 
tends to be close to zero). 

• Repeat this random shuffling procedure many times, and construct the empirical 
distribution of Div under the null hypothesis that the two distributions are the same. 

• Approximate the p-value by evaluating the relative ranking of the original Div in the 
distribution of Div. 



When an asymmet ric divergence s uch as the KL divergence (jKullback and Leibleri . fl95lT ) 



or the PE divergence (P earson . 1900) is adopted for two-sample homogeneity test, the test 



results depend on the choice of directions: a divergence from P to P' or from P' to P. 
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Sugivama et al. (mil) proposed to choose the direction that gives a smaller p-value — it 



was experimentally shown that, when the uLSIF-based PE divergence estimator is used for 
the two-sample homogeneity test (which is called the least-squares two-sample homogeneity 
test; LSTT), the heuristic of choosing the direction with a smaller p-value contributes to 
reducing the type-II error (the probability of accepting incorrect null- hypotheses, i.e., two 
distributions are judged to be the same when they are actually different), while the increase 
of the type-I error (the probability of rejecting correct null- hypotheses, i.e., two distributions 
are judged to be different when they are actually the same) is kept moderate. 

Below, we refer to LSTT with p(x)/p'(x) as the plain LSTT, LSTT with p'(x)/p(x) as 
the reciprocal LSTT, and LSTT with heuristically choosing the one with a smaller p-value 
as the adaptive LSTT. 



4.1.2 Artificial Datasets 

We illustrate how the proposed method behaves in two-sample homogeneity test scenarios 
using the artificial datasets (a)-(d) described in Section 12.41 We test the plain LSTT, 
reciprocal LSTT, and adaptive LSTT for a = 0, 0.5, and 0.95, with significance level 5%. 

The experimental results are shown in Figure [5j For the dataset (a) where P = P' (i.e., 
the null hypothesis is correct), the plain LSTT and reciprocal LSTT correctly accept the 
null hypothesis with probability approximately 95%. This means that the type-I error is 
properly controlled in these methods. On the other hand, the adaptive LSTT tends to give 
slightly lower acceptance rates than 95% for this toy dataset, but the adaptive LSTT with 
a = 0.5 still works reasonably well. This implies that the heuristic of choosing the method 
with a smaller p-value does not have critical influence on the type-I error. 

In the datasets (b), (c), and (d), P is different from P' (i.e., the null hypothesis is not 
correct), and thus we want to reduce the acceptance rate of the incorrect null- hypothesis 
as much as possible. In the plain setup for the dataset (b) and the reciprocal setup for the 
dataset (c), the true density-ratio functions with a = diverge to infinity, and thus larger 
a makes the density-ratio approximation more reliable. However, a = 0.95 does not work 
well because it produces an overly-smoothed density-ratio function and thus it is hard to 
be distinguished from the completely constant density-ratio function (which corresponds to 
P = P'). On the other hand, in the reciprocal setup for the dataset (b) and the plain setup 
for the dataset (c), small a performs poorly since density-ratio functions with large a can 
be more accurately approximated than those with small a (see Figure [T]) • In the adaptive 
setup, large a tends to perform slightly better than small a for the datasets (b) and (c). 

In the dataset (d), the true density-ratio function with a = diverges to infinity for 
both the plain and reciprocal setups. In this case, middle a performs the best, which well 
balances the trade-off between high distinguishability from the completely constant density- 
ratio function (which corresponds to P = P') and easy approximability. The same tendency 
that middle a works well can also be mildly observed in the adaptive LSTT for the dataset 
(d). 

Overall, if the plain LSTT (or the reciprocal LSTT) is used, small a (or large a) some- 
times works excellently. However, it performs poorly in other cases and thus the performance 
is unstable depending on the true distributions. The plain LSTT (or the reciprocal LSTT) 
with middle a tends to perform reasonably well for all datasets. On the other hand, the 
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(a) P' = N(0, 1): P and P' are the same. 




(b) P' = iV(0,0.6): P 1 has smaller standard deviation than P. 




(c) P' — N(0,2): P' has larger standard deviation than P. 




(d) P' = N(0.5, 1): P and P' have different means. 



Figure 5: Illustrative examples of two-sample homogeneity test based on relative divergence 
estimation. From left to right: true densities (P = N(0, 1)), the acceptance rate 
of the null hypothesis under the significance level 5% by plain LSTT, reciprocal 
LSTT, and adaptive LSTT. 
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adaptive LSTT was shown to nicely overcome the above instability problem when a is small 
or large. However, when a is set to be a middle value, the plain LSTT and the reciprocal 
LSTT both give similar results and thus the adaptive LSTT provides only a small amount 
of improvement. 

Our empirical finding is that, if we have prior knowledge that one distribution has a 
wider support than the other distribution, assigning the distribution with a wider support 
to P' and setting a to be a large value seem to work well. If there is no knowledge on the 
true distributions or two distributions have less overlapped supports, using middle a in the 
adaptive setup seems to be a reasonable choice. 

We will systematically investigate this issue using more complex datasets below. 



4.1.3 Benchmark Datasets 



Here, we apply the proposed two-sam ple homogene i ty tes t to the binary classification 
datasets taken from the IDA repository ( Ratsch et all l200l"l ) . 



We test the adaptive LSTT with the RuLSIF-based PE diverg ence estimator for a = , 
0.5, and 0.95; we also test the maximum mean discrepancy (MMD; Borgwardt et al. . 20061 ). 



which is a kernel-based two-sample homogeneity test method. The performance of MMD 
de pends on the choice o f the Gaussian kernel width. Here, we adopt a version proposed 
by ISriperumbudur et al. I (|2009h . which automatically optimizes the Gaussian kernel width. 



The p-values of MMD are computed in the same way as LSTT based on the permutation 
test procedure. 

First, we investigate the rate of accepting the null hypothesis when the null hypothesis is 
correct (i.e., the two distributions are the same). We split all the positive training samples 
into two sets and perform two-sample homogeneity test for the two sets of samples. The 
experimental results are summarized in Table dj showing that the adaptive LSTT with 
a = 0.5 compares favorably with that with a = and 1. LSTT with a = 0.5 and MMD 
are comparable to each other in terms of the type-I error. 

Next, we consider the situation where the null hypothesis is not correct (i.e., the two 
distributions are different). The numerator samples are generated in the same way as above, 
but a half of denominator samples are replaced with negative training samples. Thus, 
while the numerator sample set contains only positive training samples, the denominator 
sample set includes both positive and negative training samples. The experimental results 
are summarized in Table [2J showing that the adaptive LSTT with a = 0.5 again compares 
favorably with that with a = and 1. Furthermore, LSTT with a = 0.5 tends to outperform 
MMD in terms of the type-II error. 

Overall, LSTT with a = 0.5 is shown to be a useful method for two-sample homogeneity 
test. 



4.2 Inlier-Based Outlier Detection 

Next, we apply the proposed method to outlier detection. 



4.2.1 Density-Ratio Approach to Inlier-Based Outlier Detection 

Let us consider an outlier detection problem of finding irregular samples in a dataset (called 
an "evaluation dataset") based on another dataset (called a "model dataset") that only 
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Table 1: Experimental results of two-sample homogeneity test for the IDA datasets. The 
mean (and standard deviation in the bracket) rate of accepting the null hypoth- 
esis (i.e., P = P') under the significance level 5% is reported. The two sets of 
samples are both taken from the positive training set (i.e., the null hypothesis is 
correct). The best method having the highest mean acceptance rate and compa- 
rable methods according to the t-test at the significance level 5% are specified by 
bold face. 



Datasets 


d 


n = n' 


MMD 


LSTT 
(a = 0.0) 


LSTT 
(a = 0.5) 


LSTT 

(a = 0.95) 


banana 


2 


100 


0.98(0.14) 


0.93(0.26) 


0.92(0.27) 


0.92(0.27) 


thyroid 


5 


19 


0.98(0.14) 


0.95(0.22) 


0.95(0.22) 


0.88 (0.33) 


titanic 


5 


21 


0.92(0.27) 


0.86(0.35) 


0.92(0.27) 


0.89(0.31) 


diabetes 


8 


85 


0.95(0.22) 


0.87 (0.34) 


0.91(0.29) 


0.82 (0.39) 


breast-cancer 


9 


29 


0.98(0.14) 


0.91 (0.29) 


0.94(0.24) 


0.92(0.27) 


flare-solar 


9 


100 


0.93(0.26) 


0.91(0.29) 


0.95(0.22) 


0.93(0.26) 


heart 


13 


38 


0.96(0.20) 


0.85 (0.36) 


0.91(0.29) 


0.93(0.26) 


german 


20 


100 


0.93(0.26) 


0.91(0.29) 


0.92(0.27) 


0.89(0.31) 


ringnorm 


20 


100 


0.95(0.22) 


0.93(0.26) 


0.91(0.29) 


0.85 (0.36) 


waveform 


21 


66 


0.93(0.26) 


0.92(0.27) 


0.93(0.26) 


0.88(0.33) 



Table 2: Experimental results of two-sample homogeneity test for the IDA datasets. The 
mean (and standard deviation in the bracket) rate of accepting the null hypothesis 
(i.e., P = P') under the significance level 5% is reported. The set of samples 
corresponding to the numerator of the density ratio is taken from the positive 
training set and the set of samples corresponding to the denominator of the density 
ratio is taken from the positive training set and the negative training set (i.e., 
the null hypothesis is not correct). The best method having the lowest mean 
acceptance rate and comparable methods according to the t-test at the significance 
level 5% are specified by bold face. 



Datasets 


d 


n = n' 


MMD 


LSTT 
(a = 0.0) 


LSTT 

(a = 0.5) 


LSTT 
(a = 0.95) 


banana 


2 


100 


0.80 (0.40) 


0.10(0.30) 


0.02(0.14) 


0.17(0.38) 


thyroid 


5 


19 


0.72 (0.45) 


0.81 (0.39) 


0.65(0.48) 


0.80 (0.40) 


titanic 


5 


21 


0.79(0.41) 


0.86(0.35) 


0.87(0.34) 


0.88(0.33) 


diabetes 


8 


85 


0.38(0.49) 


0.42(0.50) 


0.47 (0.50) 


0.57 (0.50) 


breast-cancer 


9 


29 


0.91 (0.29) 


0.75(0.44) 


0.80(0.40) 


0.79(0.41) 


flare-solar 


9 


100 


0.59(0.49) 


0.81 (0.39) 


0.55(0.50) 


0.66(0.48) 


heart 


13 


38 


0.47(0.50) 


0.28(0.45) 


0.40(0.49) 


0.62 (0.49) 


german 


20 


100 


0.59 (0.49) 


0.55 (0.50) 


0.44(0.50) 


0.68 (0.47) 


ringnorm 


20 


100 


0.00(0.00) 


0.00(0.00) 


0.00(0.00) 


0.02(0.14) 


waveform 


21 


66 


0.00(0.00) 


0.00(0.00) 


0.02(0.14) 


0.00(0.00) 
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Table 3: Mean AUC score (and the standard deviation in the bracket) over 1000 trials for 
the artificial outlier-detection dataset. The best method in terms of the mean 
AUC score and comparable methods according to the t-test at the significance 
level 5% are specified by bold face. 



Input 


RuLSIF 


RuLSIF 


RuLSIF 


dimensionality d 


{a = 0) 


(a = 0.5) 


(a = 0.95) 


1 


.933(.089) 


.926(.100) 


.896 (.124) 


5 


.882(.099) 


.891(.091) 


.894(.086) 


10 


.842 (.107) 


.850(.103) 


.859(.092) 



contains regular samples. Defining the density ratio over the two sets of samples, we can 
see that the density-ratio values for regular samples are close to one, while those for outliers 
tend to be significantly deviated fro m one. Thus, density-ratio values co uld be used as an 
index of the degree of outlyingness (|Smola et all 120091 : Iffido et all l201lh . 

Since the evaluation dataset usually has a wider support than the model dataset, we 
regard the evaluation dataset as samples corresponding to the denominator density p'(x), 
and the model dataset as samples corresponding to the numerator density p(x). Then, 
outliers tend to have smaller density-ratio values (i.e., close to zero). As such, density-ratio 
approximators can be used for outlier detection. 

When evaluating the performance of outlier detection methods, it is important to take 
into account both the detection rate (i.e., the amount of true outliers an outlier detection 
algorithm can find) and the detection accuracy (i.e., the amount of true inliers an outlier 
detection algorithm misjudges as outliers). Since there is a trade-off between the detection 
rate and the detectio n accu racy, we adopt the area under the ROC curve (AUC) as our 
error metric ( Bradley . 19971 ). 



4.2.2 Artificial Datasets 

First, we illustrate how the proposed method behaves in outlier detection scenarios using 
artificial datasets. 
Let 

P = N(0,I d ), 

P' = 0.95N(0,I d ) + 0.05A^(3d- 1 / 2 l d ,/ d ), 

where d is the dimensionality of x and 1<j is the d-dimensional vector with all one. Note 
that this setup is the same as the dataset (e) described in Section [2.41 when d = 1. Here, 
the samples drawn from N(0, Id) are regarded as inliers, while the samples drawn from 
N(d~ 1 / 2 ld, Id) are regarded as outliers. We use n = n' = 100 samples. 

Table [3] describes the AUC values for input dimensionality d = 1, 5, and 10 for RuLSIF 
with a = 0, 0.5, and 0.95. This shows that, as the input dimensionality d increases, the 
AUC values overall get smaller. Thus, outlier detection becomes more challenging in high- 
dimensional cases. 
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The result also shows that RuLSIF with small a tends to work well when the input 
dimensionality is low, and RuLSIF with large a works better as the input dimensionality 
increases. This tendency can be interpreted as follows: If a is small, the density-ratio func- 



tion tends to have sharp 'hollow' for outlier points (see the leftmost graph in Figure 2(e) ). 
Thus, as long as the true density-ratio function can be accurately estimated, small a would 
be preferable in outlier detection. When the data dimensionality is low, density-ratio ap- 
proximation is rather easy and thus small a tends to perform well. However, as the data 
dimensionality increases, density-ratio approximation gets harder, and thus large a which 
produces a smoother density-ratio function is more favorable since such a smoother function 
can be more easily approximated than a 'bumpy' one produced by small a. 

4.2.3 Real-World Datasets 

Next, we evaluate the proposed outlier detection method using various real-world datasets: 
IDA repository: The IDA repository ( Ratsch et all 200 ll ) contains various binary classi- 



fication tasks. Each dataset consists of positive/negative and training/test samples. 
We use positive training samples as inkers in the "model" set. In the "evaluation" set, 
we use at most 100 positive test samples as inkers and the first 5% of negative test 
samples as outliers. Thus, the positive samples are treated as inkers and tke negative 
samples are treated as outliers. 

Speech dataset: An in-kouse speeck dataset, wkick contains skort utterance samples 
recorded from 2 male subjects speaking in Frenck witk sampling rate 44.1kHz. From 
eack utterance sample, w e extracted a 50-dimensional line spectral frequencies vector 
( Kain and Macon . 1998). We randomly take 200 samples from one class and assign 



tkem to tke model dataset. Tken we randomly take 200 samples from tke same class 
and 10 samples from tke otker class. 

20 Newsgroup dataset: Tke 20-Newsgroups datase10 contains 20000 newsgroup docu- 
ments, wkick contains tke following 4 top-level categories: 'comp', 'rec', 'sci', and 
'talk'. Eack document is expressed by a 100-dimensional bag-of-words vector of term- 
frequencies. We randomly take 200 samples from tke 'comp' class and assign tkem to 
tke model dataset. Tken we randomly take 200 samples from tke same class and 10 
samples from one of tke otker classes for tke evaluation dataset. 

The USPS hand- written digit dataset: Tke USPS kand- written digit datasetH con- 
tains 9298 digit images. Eack image consists of 256 (= 16 x 16) pixels and eack pixel 
takes an integer value between and 255 as tke intensity level. We regard samples 
in one class as inkers and samples in otker classes as outliers. We randomly take 
200 samples from tke inker class and assign tkem to tke model dataset. Tken we 
randomly take 200 samples from tke same inker class and 10 samples from one of tke 
otker classes for tke evaluation dataset. 

We compare tke AUC scores of RuLSIF witk a = 0, 0.5, and 0.95, and on e-class sup- 
port vector machine (OSVM) witk tke Gaussian kernel ( Sckolkopf et all l200ll ). We used 



1. 
2. 



http: //people . csail .mit . edu/jrennie/20Newsgroups/ 
http: //www . gaussianprocess . org/gpml/data/ 
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the LIBSVM implementation of OSVM (jChang and Linl . l200lh . The Gaussian width is 
set to the median d i stance between samples, which has been shown to be a useful heuris- 
tic (IScholkopf et all l200lh . Since there is no systematic method to determine the tuning 
parameter v in OSVM, we report the results for v = 0.05 and 0.1. 

The mean and standard deviation of the AUC scores over 100 runs with random sample 
choice are summarized in Table [U showing that RuLSIF overall compares favorably with 
OSVM. Among the RuLSIF methods, small a tends to perform well for low-dimensional 
datasets, and large a tends to work well for high-dimensional datasets. This tendency well 
agrees with that for the artificial datasets (see Section [4. 2. 2ft . 



4.3 Transfer Learning 

Finally, we apply the proposed method to outlier detection. 



4.3.1 Transductive Transfer Learning by Importance Sampling 

Let us consider a problem of semi-supervised learning ( Chapelle et al. . 20061 ) from labeled 
training samples {(xj, 2/^ r )}j=i anc ^ unlabeled test samples {a:* 6 }™^. The goal is to predict a 
test output value y te for a test input point x te . Here, we consider the setup where the labeled 
training samples {(^ r >2/j r )}"=i are drawn i.i.d. from p(y\x)pt r (x), while the unlabeled test 
samples {cc* }™^ are drawn i.i.d. from p tc (x), which is generally different from pt T (x); the 
(unknown) test sample (x te , y te ) follows p(y\x)p te (x). This setup means that the conditional 
probability p(y\x) is common to training and test samples, but the marginal densities ptr(%) 
and pte( x ) are generally differe nt for training and tes t input points. Such a problem is called 
trans ductive transfer lea rning (Pan and Yang . 201C), domain adaptat i on (Jiang and Zhail . 
2007 ). or covariate shift ( Shimodaira . 2000l : Sugiyama and Kawanabe . 2011 ). 



Let loss(y, y) be a point-wise loss function that measures a discrepancy between y and 
y (at input x). Then the generalization error which we would like to ultimately minimize 
is defined as 



^ P (y\ x ) Ptc ( x ) [loss (y, f(x))] , 

where f(x) is a function model. Since the generalization error is inaccessible because the 
true pro bability p(y\x )pt e (x) is unknown, empirical-error minimization is often used in 
practice (jVapnikl . Il998l ) : 



mm 



1 



»tr 



J2^ss(yf,f(xf)) 



However, under the covariate shift setup, plain empirical-error minimization is not consistent 
(i.e., it does not converge to the optimal functi on) if the model T is misspecified (i.e., the 



true function is not included in the model; see Shimodaira . 2000l ) . Instead, the following 



importance-weighted empirical-error minimization is consistent under covariate shift: 



mm 



n tr 

E 



r(xJ)los S (yf,f(xf)) 



21 



Yamada, Suzuki, Kanamori, Hachiya, and Sugiyama 



Table 4: Experimental results of outlier detection for various for real-world datasets. Mean 
AUC score (and standard deviation in the bracket) over 100 trials is reported. 
The best method having the highest mean AUC score and comparable methods 
according to the t-test at the significance level 5% are specified by bold face. The 
datasets are sorted in the ascending order of the input dimensionality d. 



i^aiabeis 


a 


OSVM 
(u = 0.05) 


OSVM 
(u = 0.1) 


RuLSIF 

(a = 0) 


RuLSIF 

(a = 0.5) 


RuLSIF 
(a = 0.95) 


IDA:banana 


2 


.668(.105) 


.676(.120) 


.597 (.097) 


.619 (.101) 


.623 (.115) 


IDA:thyroid 


5 


.760 (.148) 


.782(.165) 


.804(.148) 


.796(.178) 


.722 (.153) 


IDA:titanic 


5 


.757(.205) 


.752(.191) 


.750(.182) 


.701 (.184) 


.712 (.185) 


IDA: diabetes 


8 


.636(.099) 


.610 (.090) 


.594 (.105) 


.575 (.105) 


.663(.112) 


IDA:b-cancer 


9 


.741(.160) 


.691 (.147) 


.707(.148) 


.737(.159) 


.733(.160) 


IDA:f-solar 


9 


.594 (.087) 


.590 (.083) 


.626(.102) 


.612(.100) 


.584 (.114) 


IDA:heart 


13 


.714 (.140) 


.694 (.148) 


.748(.149) 


.769(.134) 


.726 (.127) 


IDA:german 


20 


.612(.069) 


.604(.084) 


.605(.092) 


.597(.101) 


.605(.095) 


IDA:ringnorm 


20 


.991(.012) 


.993(.007) 


.944 (.091) 


.971 (.062) 


.992(.010) 


IDA: waveform 


21 


.812 (.107) 


.843 (.123) 


.879(.122) 


.875(.117) 


.885(.102) 


Speech 


50 


.788 (.068) 


.830(.060) 


.804 (.101) 


.821(.076) 


.836(.083) 


20News ('rec') 


100 


.598 (.063) 


.593 (.061) 


.628 (.105) 


.614 (.093) 


.767(.100) 


20News ('sci') 


100 


.592 (.069) 


.589 (.071) 


.620 (.094) 


.609 (.087) 


.704(.093) 


20News ('talk') 


100 


.661 (.084) 


.658 (.084) 


.672 (.117) 


.670 (.102) 


.823(.078) 


USPS (1 vs. 2) 


256 


.889 (.052) 


.926(.037) 


.848 (.081) 


.878 (.088) 


.898 (.051) 


USPS (2 vs. 3) 


256 


.823 (.053) 


.835 (.050) 


.803 (.093) 


.818 (.085) 


.879(.074) 


USPS (3 vs. 4) 


256 


.901 (.044) 


.939 (.031) 


.950 (.056) 


.961 (.041) 


.984(.016) 


USPS (4 vs. 5) 


256 


.871 (.041) 


.890 (.036) 


.857 (.099) 


.874 (.082) 


.941(.031) 


USPS (5 vs. 6) 


256 


.825 (.058) 


.859 (.052) 


.863 (.078) 


.867 (.068) 


.901(.049) 


USPS (6 vs. 7) 


256 


.910 (.034) 


.950 (.025) 


.972 (.038) 


.984 (.018) 


.994(.010) 


USPS (7 vs. 8) 


256 


.938 (.030) 


.967 (.021) 


.941 (.053) 


.951 (.039) 


.980(.015) 


USPS (8 vs. 9) 


256 


.721 (.072) 


.728 (.073) 


.721 (.084) 


.728 (.083) 


.761(.096) 


USPS (9 vs. 0) 


256 


.920 (.037) 


.966 (.023) 


.982 (.048) 


.989 (.022) 


.994(.011) 
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where r(x) is called the importance ( Fishman . 19961 ) in the context of covariate shift adap- 
tation: 

/ n Pte(x) 

However, since importance- weighted learning is not statistically efficient (i.e., it tends 
to have larger variance), sl ightly flattening th e importance weights is practically useful for 
stabilizing the estimator. Ishimodairal ffl) proposed to use the exponentially- flattened 
importance weights as 



mm 



r(xf) T loss(yf ' 



r ,/«)) 



where < r < 1 is called the exponential flattening parameter, r = corresponds to plain 
empirical-error minimization, while r = 1 corresponds to importance-weighted empirical- 
error minimization; < r < 1 will give an intermediate estimator that balances the trade-off 
between statistical efficiency and consistency. The exponential flattening parameter r can be 
optimized by model selection criteria such as th e importance-weighted Akaike information 
criterion for regular models ( Shimodaira . 200(ih. the imvortanc e-weiqhted subspace infor- 
mation criterion for linear models ( Sugivama and Miiller . 20051 ). and importance-weighted 



cross-validation for arbitrary models (jSugiyama et all 120071 ). 



One of the potential drawbacks of the above exponential flattering approach is that 
estimation of r(x) (i.e., r = 1) is rather hard, as shown in this paper. Thus, when r(x) 
is estimated poorly, all flattened weights r(x) T are also unreliable and then covariate shift 
adaptation does not work well in practice. To cope with this problem, we propose to use 
relative importance weights alternatively: 



mm 

fGF 



1 



ntr 



where r a (x) (0 < a < 1) is the a-relative importance weight defined by 

r , x . = Ptejx) 

(l-a)pte(x) + aptr(x)' 

Note that, compared with the definition of the a-relative density-ratio ([1]), a and (1 — a) 
are swapped in order to be consistent with exponential flattening. Indeed, the relative 
importance weights play a similar role to exponentially-flattened importance weights; a = 
corresponds to plain empirical-error minimization, while a = 1 corresponds to importance- 
weighted empirical-error minimization; < a < 1 will give an intermediate estimator 
that balances the trade-off between efficiency and consistency. We note that the relative 
importance weights and exponentially flattened importance weights agree only when a = 
t = and a = r = 1; for < a = r < 1, they are generally different. 

A possible advantage of the above relative importance weights is that its estimation for 
< a < 1 does not depend on that for a = 1, unlike exponentially- flattened importance 
weights. Since a-relative importance weights for < a < 1 can be reliably estimated by 
RuLSIF proposed in this paper, the performance of covariate shift adaptation is expected 
to be improved. Below, we experimentally investigate this effect. 
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4.3.2 Artificial Datasets 



First, we illustrate how the proposed method behaves in covariate shift adaptation using 
one-dimensional artificial datasets. 

In this experiment, we employ the following kernel regression model: 



»tc 



/(z;/3) = J>exp' " 



i=l 



2p 2 



where /3 = (/3i, . . . , /3„ tc ) T is the parameter to be learned and p is the Gaussian width. The 
parameter (3 is learned by relative importance-weighted least-squares (RIW-LS): 



/3riw-ls = argmin 
/3 



— E f -(^) (/(*?;/*) 

n tr z — ' 

or exponentially-flattened importance-weighted least-squares (EIW-LS): 



/^eiw- 



ls 



argmin 
/3 



f^) T (/(xf;/3) 



tr\ 2 

2/i ) 



The relative importance weight r a {xj) is estimated by RuLSIF, and the exponentially- 
flattened importance weight r(x* r ) r is estimated by uLSIF (i.e., RuLSIF with a = 1). T he 



Gaus sian width p is chosen by 5-fold importance-weighted cross-validation (jSugiyama et al 
20071 1 . 

First, we consider the case where input distributions do not change: 

Ptr = Pte = iV(l,0.25). 



The densities and their ratios are plotted in Figure 6(a) The training output samples 
{vf }?=i are generated as 



y 



tr 



sine (x 



J) 



+ e 



tr 



where {ef} T ^ 1 is additive noise following iV(0,0.01). We set n tr = 100 and n te = 200. 
Figure 6(b) | shows a realization of training and test samples as well as learned functions 
obtained by RIW-LS with a = 0.5 and EIW-LS with r = 0.5. This shows that RIW-LS 
with a = 0.5 and EIW-LS with r = 0.5 give almost the same functions, and both functions 



fit the true function well in the test region. Figure 6(c) shows the mean and standard 



deviation of the test error under the squared loss over 200 runs, as functions of the relative 
flattening parameter a in RIW-LS and the exponential flattening parameter r in EIW- 
LS. The method having a lower mean test error and another method that is comparable 
according to the t-test at the significance level 5% are specified by 'o'. As can be observed, 
the proposed RIW-LS compares favorably with EIW-LS. 



Next, we consider the situation where input distribution changes (Figure 7(a) ): 



P tr = JV(1,0.25), 
P te = N(2,0.1). 
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2 \—Ptr(x) 
■■■Pte(x) 

r.fcr) 

1.5^— (n(x))°- 

■r u . 5 {x) 
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0.5) 


--RIW-LS (a-- 
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LS 
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Figure 6: Illustrative example of transfer learning under no distribution change. 



2\—Ptr{ x ) 
■■■Pte(x) 

n (x) 

1.5 ---(n(s)) a5 

\--ro.s(x) 

1 
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a; 

(a) Densities and ratios 



o Training 




x Test 




— TRUE 




-■■EIW-LS (t = 


0.5) 


RIW-LS (a 


= 0.5) 




-10 12 3 

x 

(b) Learned functions 



0.1 



£0.05 



--EIW- 


LS 


— RIW- 


-LS 



0.5 

r — a 



(c) Test error 



Figure 7: Illustrative example of transfer learning under covariate shift. 



The output values are created in the same way as the previous case. Figure |7(b)| shows a 
realization of training and test samples as well as learned functions obtained by RIW-LS 
with a = 0.5 and EIW-LS with r = 0.5. This shows that RIW-LS with a = 0.5 fits the true 
function slightly better than EIW-LS with r = 0.5 in the test region. Figure 7(c) shows that 
the proposed RIW-LS tends to outperform EIW-LS, and the standard deviation of the test 
error for RIW-LS is much smaller than EIW-LS. This is because EIW-LS with < r < 1 is 
based on an importance estimate with r = 1, which tends to have high fluctuation. Overall, 
the stabilization effect of relative importance estimation was shown to improve the test 
accuracy. 



4.3.3 Real- World Datasets 

Finally, we evaluate the proposed transfer learning method on a real- world transfer learning 
task. 

We consider the problem of human activity recognition from accelerometer data col- 
lected by iPod toucHo. In the data collection procedure, subjects were asked to perform a 
specific action such as walking, running, and bicycle riding. The duration of each task was 
arbitrary and the sampling rate was 20Hz with small variations. An example of three-axis 
accelerometer data for "walking" is plotted in Figure EJ 

3. http://alkan.mns.kyutech.ac . jp/web/data.html 
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Figure 8: An example of three-axis accelerometer data for "walking" collected by iPod 
touch. 



To extract features from the accelerometer data, each data stream was segmented in a 
sliding window manner with window width 5 seconds and sliding step 1 second. Depending 
on subjects, the position and orientation of iPod touch was arbitrary — held by hand or kept 
in a pocket or a bag. For this reason, we decided to take the i^-norm of the 3-dimensional 
acceleration vector at each time step, and computed the following 5 orientation-invariant 
features from each window: mean, st andard deviation , fluc t uation of amplitude, a verage 



energy, and frequency- domain entropy (|Bao and Intilld . 120041 ; iBharatula et all 120051 ) 



Let us consider a situation where a new user wants to use the activity recognition 
system. However, since the new user is not willing to label his/her accelerometer data 
due to troublesomeness, no labeled sample is available for the new user. On the other 
hand, unlabeled samples for the new user and labeled data obtained from existing users 
are available. Let labeled training data {{xj , yj r )}j=i be the set of labeled accelerometer 
data for 20 existing users. Each user has at most 100 labeled samples for each action. Let 
unlabeled test data {a;* 6 }™^ be unlabeled accelerometer data obtained from the new user. 

We use kernel logistic regression (KLR) for activity recognition. We compare the fol- 
lowing four methods: 

• Plain KLR without importance weights (i.e., a = or r = 0). 

• KLR with relative importance weights for a = 0.5. 

• KLR with exponentially-flattened importance weights for r = 0.5. 

• KLR with plain importance weights (i.e., a = 1 or r = 1). 

The experiments are repeated 100 times with different sample choice for nt r = 500 and 
rite = 200. Table [5] depicts the classification accuracy for three binary-classification tasks: 
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Table 5: Experimental results of transfer learning in real-world human activity recognition. 

Mean classification accuracy (and the standard deviation in the bracket) over 100 
runs for activity recognition of a new user is reported. The method having the 
lowest mean classification accuracy and comparable methods according to the t- 
test at the significance level 5% are specified by bold face. 



Task 


KLR 

(a = 0, t = 0) 


RIW-KLR 

(a = 0.5) 


EIW-KLR 
(r = 0.5) 


IW-KLR 

(a = 1, r = 1) 


Walks vs. run 


0.803 (0.082) 


0.889(0.035) 


0.882(0.039) 


0.882(0.035) 


Walks vs. bicycle 


0.880 (0.025) 


0.892(0.035) 


0.867 (0.054) 


0.854 (0.070) 


Walks vs. train 


0.985 (0.017) 


0.992(0.008) 


0.989 (0.011) 


0.983 (0.021) 



walk vs. run, walk vs. riding a bicycle, and walk vs. taking a train. The classification accuracy 
is evaluated for 800 samples from the new user that are not used for classifier training (i.e., 
the 800 test samples are different from 200 unlabeled samples). The table shows that KLR 
with relative importance weights for a = 0.5 compares favorably with other methods in 
terms of the classification accuracy. KLR with plain importance weights and KLR with 
exponentially-flattened importance weights for r = 0.5 are outperformed by KLR without 
importance weights in the walk vs. riding a bicycle task due to the instability of importance 
weight estimation for a = 1 or r = 1. 

Overall, the proposed relative density-ratio estimation method was shown to be useful 
also in transfer learning under covariate shift. 



5. Conclusion 



In this paper, we proposed to use a relative divergence for robust distribution compari- 
son. We gave a computationally efficient method for estimating the relative Pearson di- 
vergence based on direct relative density-ratio approximation. We theoretically elucidated 
the convergence rate of the proposed divergence estimator under non-parametric setup, 
which showed that the proposed approach of estimating the relative Pearson divergence 
is more preferable than the existing approach of estimating the plain Pearson divergence. 
Furthermore, we proved that the asymptotic variance of the proposed divergence estima- 
tor is independent of the model complexity under a correctly-specified parametric setup. 
Thus, the proposed divergence estimator hardly overfits even with complex models. Exper- 
imentally, we demonstrated the practical usefulness of the proposed divergence estimator 
in two-sample homogeneity test, inlier-based outlier detection, and transductive transfer 
learning under covariate shift. 

In addition to two-sample homogeneity test, outlier detection, and transfer learn- 
ing, density ratios were shown to b e useful for tackling various machine learning prob- 



lems, inchidingjiiulti^ (jBickel et all 120081: ISimm et a J. l201lh . independence 



test ( Sueiva ma and Suzuki l201lh , feature selection (jSuzuki et all 120091). causal inference 



20ld ). i ndependent componen t analy sis ([Suzuki and Sugivamal . 



(Ya mada and Sugivama 

2011 ), dimensionality reducti on (jSuzuki and Sugivamal. l2010f). unpair ed data matching 



([Yamada and Sugivama . clustering (jKimura and Sugivamal . |201l|) , conditional den 
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sity estimation ( Sugiyama et al. . 2010I ). and probabilistic classification ( Sugiyama . 201Cll ). 



Thus, it would be promising to explore more applications of the proposed relative density- 
ratio approximator beyond two-sample homogeneity test, outlier detection, and transfer 
learning tasks. 
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Appendix A. Technical Details of Non-Parametric Convergence Analysis 

Here, we give the technical details of the non-parametric convergence analysis described in 
Section 13.11 

A.l Results 

For notational simplicity, we define linear operators P,P n ,P',P' n , as 



/'/: !>/• /',/ : 



n 



P'f:=E q f, P' n ,f:= !2=kM> 



For a G [0, 1], we define <SVi,n' and S as 

S n ,n' = aP n + (1 - a)P' n ,, S = aP + (1 - a)P'. 

We estimate the Pearson divergence between p and ap + (1 — a)q through estimating the 
density ratio 

* •= p 

ap + (1 — a)p' 

Let us consider the following density ratio estimator: 



g := argmm 



l - {aP n + (1 - a)P^) g 2 - P n g + ^-R{gf 



= argmin (\s n , nl g 2 - P n g + ^R(g) 2 
geg V 2 1 

where n = min(n, n') and R{g) is a non-negative regularization functional such that 

sup[|s(aO|] < R(g). (16) 
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A possible estimator of the Pearson (PE) divergence PE Q is 



Another possibility is 



PE a :— P n g — -S n ^ n ig 2 — -. 



PE Q ;= X -P n g- X -. 



A useful example is to use a reproducing kernel Hilbert space (RKHS; lAronszainl . Il950h 



as Q and the RKHS norm as R(g). Suppose Q is an RKHS associated with bounded kernel 
fe( v ): 

sup[fc(£c, x)] < C. 

x 

Let || • \\g denote the norm in the RKHS Q. Then R{g) = ^/~C\\g\\g satisfies Eq. (fT6|) : 

g{x) = (k(x,-),g(-)) < y/k(x, x)\\g\\g < VC\\g\\g, 

where we used the reproducing property of the kernel and Schwartz's inequality. Note that 
the Gaussian kernel satisfies this with C = 1. It is known that the Gaussian kernel RKHS 
spans a dense subset in the set of continuous functions. Another example of RKHSs is 
Sobolev space. The canonical norm for this space is the integral of the squared derivatives 
of functions. Thus the regularization term R(g) = \\g\\g imposes the solution to be smooth. 
The R KHS technique in Sobolev space has been well exploited in the context of spline 
models (jWahbal . Effloh . We intend that the regularization term R(g) is a generalization of 



the RKHS norm. Roughly speaking, R{g) is like a "norm" of the function space Q. 

We assume that the true density-ratio function g*(x) is contained in the model Q and 
is bounded from above: 

g*(x) < M for all x £ V X - 
Let Qm be a ball of Q with radius M > 0: 

Qm :={geg\ R(g) < M}. 

To derive the convergence rate of our estimator, we utilize the bracketing entropy tha t is a 
complexity measure of a function class (see p. 83 of van der Vaart and Wellner . 19961 ). 



Definition 1 Given two functions I and u, the bracket [I, u] is the set of all functions f 
with l(x) < f{x) < u(x) for all x. An e-bracket with respect to Lt2{p) is a bracket [Z, it] with 
\\l — u||x, 2 (p) < e. The bracketing entropy T-ln(T, e, ^(p)) is the logarithm of the minimum 
number of e-brackets with respect to ^(p) needed to cover a function set J-. 

We assume that there exists 7 (0 < 7 < 2) such that, for all M > 0, 
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This quantity represents a complexity of function class Q — the larger 7 is, the more com- 
plex the function class Q is because, for larger 7, more brackets are needed to cover 
the function class. The Ga ussian RKHS satisfies this condition for arbitrarily small 7 
( Steinwart and Scovel . 20071 ). Note that when R(g) is the RKHS norm, the condition (|17p 
holds for all M > if that holds for M = 1. 
Then we have the following theorem. 



Theorem 2 Let n = min(n,n'), M = \\g*\\oo, and c = (1 + a)^P(g* - Pg*) 2 + (1 - 
a)\JP'(g* — P'g*) 2 . Under the above setting, if Xn — > and A^ 1 = o(n 2// ( 2+7 )), then we 
have 

PE a - PE a = O p {X n max(l, R{g*) 2 ) + n- 1/2 cM ), 

and 

PE a - PE a =O p (A s max{l,M| (1 " i) ,ii( 5 *)M ( f (1 " i) ,^( 5 *)} + A| max{M* ,M*R(jg*)}), 
where O p denotes the asymptotic order in probability. 

In the proof of Theorem [21 we use the following auxiliary lemma. 
Lemma 3 Under the setting of Theorem^ i/A s — > and A^ 1 = o(n 2 /( 2+7 )), then we have 

\\9-9*\\l 2 (S) = O p {\)! 2 max{l, R(g*)}), R(g) = O p (max{l,R(g*)}), 
where \\ ■ \\l 2 (S) denotes the L2{ap + (1 — a)q)-norm. 

A. 2 Proof of Lemma [3] 

First, we prove Lemma [3l 

From the definition, we obtain 

^S n yg 2 - P n g + X n R(g) 2 < \s n , n ,g* 2 - P n g* + X n R(g*) 2 
=> \Sn,w{g - g*) 2 - S n>n ,(g*(g* - g)) - P n (g - g*) + X n (R(g) 2 - R(g*f) < 0. 
On the other hand, S(g*(g* — g)) = P(g* — g) indicates 

l(S- S n;n ,)(g - g*) 2 - (S - S n , n ,)(g*(g* - g)) - (P - P n ){g - g*) - X n (R(g) 2 - R{g*f) 
1 



>^S{g-g*) 2 . 

Therefore, to bound \\g — g*\\L 2 rs), it suffices to bound the left-hand side of the above 
inequality. 

Define Fm and as 

?m ■= {g - g* I g e Qm} and T 2 M ■- {f 2 \ f e T M }. 
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To bound \(S — S n y)(g — g*) 2 \, we need to bound the bracketing entropies of T^. We show 
that 

« D (^,^ 2 W) = o((^±^ 2N7 
« n (^,^)) = o" (M + M ° )2x, 



This can be shown as follows. Let fi and fu be a <5-bracket for Qm with respect to L%(p); 
fb{x) < fu(x) and — fu\\L 2 {p) — Without loss of generality, we can assume that 
WfL\\ Loo , \\fu\\ Loo <M + M . Then f v and f' L defined as 

fij(x) := max{fl(x),f$(x)}, 

mm{fl(x), fu(x)} (sign(/ L (x)) = sign(/ a (x))), 
(otherwise) 



< 



< 



are also a bracket such that f' L < g 2 < f{j for all g 6 £7jv/ s.t. /z, < <7 < /[/ an d — 
/^■||x, 2 ( p ) < 25(M + Mo) because — /c/|Il 2 (p) < S and the following relation is met: 

U'l{x) - fvix)f < " f2u{x)? (sign(/L(x)) = si ^fu{x))), 

~ 1 max{/*(x), fu( x )} (otherwise) 

\f L {x) - fu{x)f{f L {x) + fu{x)) 2 (sign(/ L (x)) = sign^x))), 
max{/^(x), fij{x)} (otherwise) 

\f L {x) - fu(x)) 2 (f L (x) + fu(x)) 2 (sign(/ L (x)) = sign(/ c/ (x))), 
- Mz)) 2 (l/4*)l + \fu{x)\) 2 (otherwise) 
<4(fL(x)-fu(x)) 2 (M + M ) 2 . 

Therefore the condition for the bracketing entropies (fT7|) gives T-Lr\ (J 72 ^ 5, L 2 (p)) = 

0(( (A/+ f° )2 ) 7 ). We can also show that n {] (T 2 M ,5,L 2 (q)) = O (( (M+ f o)2 ) 7 ) in the 
same fashion. 

Let f -=g — g* ■ Then, as in Lemma 5.14 and Theorem 10.6 in van de Geer ( 200d ). we 
obtain 

\(S n ,n> ~ S)(f 2 )\ < a\(P n - P)(f 2 )\ + (1 - a)\{P' n , - P'){f 2 )\ 

=aO p (^\\f\^ p) {l + R(gf + M 2 )'i V n^(l + R{g) 2 + M 2 )) 

+ (1 - a)O p (-L||/2||^J, )( i + E(?) 2 + M 2)i v n -^ (1 + fl(5) a + M 2)^ 
<O p (4=11/211^(1 + E( ? )2 + M 2)| v fT*fc(l + R(g) 2 + M 2 )) , (18) 



n 



where a V 6 = max(a, 6) and we used 



«H/"ll2 2 5,) + (l-«)ll/"ll2 2 fp < ( / / 4 d(aP + (l-a)P') ~ =||/ 2 



l(l-i) 



I 1 2 
\L 2 (S) 



31 



Yamada, Suzuki, Kanamori, Hachiya, and Sugiyama 



by Jensen's inequality for a concave function. Since 



II/ 2 IIl 2( 5) < ll/IU 2( 5) V 2 (! + R (a) 2 + M 2 ), 

the right-hand side of Eq. (ll8jl is further bounded by 

\(S n , n > - S)(f)\ 

=o p + R(a? + v n-«T=r(i + R(gf + m 2 )) . (19) 

Similarly, we can show that 

\(S n ,n>-S)(g*(g*-g))\ 

=O p fl=\\f0 s) (l + R(g)M + M 2 )i V n"^(l + R(g)M + M 2 )) , (20) 



and 



\{P„ - PH?j - fi)\ o„ ( -4 ll/lli 2 (|) (i + m + M )i v fr*k(i + + Mo; 



< O p (-Lll/H^yi + R(g) + M )iM| (1 V fi-a*F(l + + Af„)) , (21) 
where we used 

II/IU 2 (P) = J I PdP = J J f 2 g*dS < m| J J pdS 

in the last inequality. Combining Eqs. ([T9|) . (pOj) . and (f2Tjh we can bound the ^(.S^-norm 
of / as 



l 2 ( S ) + KR(g) 

< X n R(g*) 2 + O p (-^H/ll^l) (1 + R(gf + M 2 )t+i V rT^(l + i?(?) 2 + M 2 )) . (22) 
The following is similar to the argument in Theorem 10.6 in Ivan de Geer (2000), but we 



give a simpler proof. 

By Young's inequality, we have a2~4&2 + 4 < (I — ^)a+ (| + < a + 6 for all a, 6 > 0. 
Applying this relation to Eq. (l22p . we obtain 

^il/ilL(s) + ^(?) 2 

< XnR(g*f + O p (ll/llg^ {n-^(l + i?(?) 2 + M 2 )} l+1 V rT^(l + R(gf + M 2 ; 

1 / 2 2 



< A^T + ^11/111,(5) + Op [n-—(l + R(gy + M 2 ) + fT^(l + it>(?) 2 + M 2 
A s fi( 9 *) 2 + J||/||| 2(5) + O v (V^(l + i?(?) 2 + M 2 



^o 2 ) J , 
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which indicates 

\\\f\\l 2i s) + ^R(g) 2 < X n R{g*) 2 + o p (X n (l + R(g) 2 + M 2 )) . 
Therefore, by moving o p (XnR(g) 2 ) to the left hind side, we obtain 

\\\f\\Us) + An(l - o p {l))R{gf < O p + R{g*f + M 2 )) 

<O p (X n (l + R(g*) 2 ))- 

This gives 

||/IU 2 (S)=Op(A|ma X {l, J R( 5 *)}), 

R(g) = O p (Vl + R(g*) 2 ) = O p (max{l,R(jg*)}). 

Consequently, the proof of Lemma [3] was completed. 

A. 3 Proof of Theorem M 

Based on Lemma O we prove Theorem [2j 

AsintheproofofLemmaEl let / :=g-g*. Since (aP+(l-a)P')(fg*) = S{fg*) = Pf, 
we have 

PE a - PE a = ^S n yg 2 - P n g - (^Sg* 2 - Pg*) 

= \snAf + g*) 2 - Pn(f + g*) - Qs<f 2 - p 9 *^ 

= \sf + \{S n ,n> ~ S)f 2 + - S)(g*f) - (P n - P)f 

+ \(Sn,n> - S)g* 2 -{P n g* -Pg*)- (23) 

Below, we show that each term of the right-hand side of the above equation is O p (Xn). By 
the central limit theorem, we have 

\(S n ^-S)g* 2 -{P n g* -Pg*) 

= O p (n-V 2 Mo ((1 + a)y/P(g* - Pg*) 2 + (1 - a)y/P'(g* - P'<?*) 2 )) • 
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Since Lemma [3] gives ||/||2 = Cp(A| max(l, R(g*))) and R(g) = p (max(l, R(g*))), 
Eqs. (fl~9j) . ([20]) . and ([21~]) in the proof of Lemma [3] imply 

I - S)f\ = O p + R{9*)) 1+ * V n-^R{g*f 

< O p (X n m^(l,R(g*) 2 )), 
\{Sn,n> - S)(g*f)\ = O p (-^11/11^(1 + R(g)M + M 2 )^ V rT^(l + R(g)M + M 2 ) 

< O p (X n max(l, R(g*)M* , M^R{g*) l ~i , M i?( 5 *), M 2 )) 

< p (A s max(l,P(/)M|,M o P(<f))), 

< p (A s max(l,i?( 5 *) 2 )), 

\{P n - P)f\ < O p (-^11/11^(1 + R(g) + Mo)*M* V n"^(l + + Mo) 

= O p (A fi max(l,M| (1 " i) ,ii(( ? *)A4 (1 " i) ,i?( 5 *))) (24) 
<O p (A fi max(l,i?( 5 *) 2 )), 

where we used A^ 1 = o(n 2 ^ 2+7 ^) and Mo < R(g*)- Lemma[3]also implies 

Sf = H/H2 = P (A s max(l, J R( 5 *) 2 )). 

Combining these inequalities with Eq. ()23p implies 

PE a - PE a = O p {\ n max(l, R{g*) 2 ) + n'^cMo), 

where we again used Md < R(g*). 
On the other hand, we have 

PE a -PE a = ±P n g-^Pg* 

= 1 [(P n - P )(g-g*) + P(g - g*) + (P n - P)g*} . (25) 



Eq.flMJ) gives 

(P„ - P)(? - g*) = O p (X n max(l, M ^ (1 ~ 5 \ P(£*)M ^ (1 ~*\ #(#*))). 
We also have 

P(?-<7*) < \\9-9*\\l 2 (P) < II? " 9*\\i*(S)M$ =O p (\Zmax(M*,M*R(g*))), 

and 

(P n - P)g* = O p (n-*y/P(g*-Pg*)Z) < o p (fT5Mo) < 0,(4 max(M ^ M| P( 5 *))), 
Therefore by substituting these bounds into the relation (|25p . one observes that 
PE Q — PE Q 

=0 p (Almax(M| ! M o ^(s*)) + A fi m (26) 
This completes the proof. ■ 
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Appendix B. Technical Details of Parametric Variance Analysis 

Here, we give the technical details of the parametric variance analysis described in Sec- 
tion! 



B.l Results 

For the estimation of the a-relative density-ratio (pQ), the statistical model 

G = {g(x;0) | G G G C M b } 

is used where b is a finite number. Let us consider the following estimator of a-relative 
density-ratio, 



9 = ar, 



npc 2 I n L — ' n J n * — ' 



Suppose that the model is correctly specified, i.e., there exists 9* such that 

g(x;6*) = r a (x). 

Then, under a mild assumption (see Theorem 5.23 of van der Vaart . 2000l ). the estimator g 



is consistent and the estimated parameter 6 satisfies the asymptotic normality in the large 
sample limit. Then, a possible estimator of the a-relative Pearson divergence PE a is 

n , n n' n 

PE Q = -J2g( Xi ) --{- £(fe)) 2 + E(?K)) 2 } - ~ 

n 2 n n J I 2 

i=l v i=i j=i 

Note that there are other possible estimators for PE Q such as 

~ 1 n 1 
PE a = -^g( Xl )--. 

i=l 

We study the asymptotic properties of PE Q . The expectation under the probability p 
(p 1 ) is denoted as Ep(a,)[-] (E p /(a.) [•]). Likewise, the variance is denoted as V p (a.)[-] (V p /(a.)[-]). 
Then, we have the following theorem. 

Theorem 4 Let ||t||oo be the sup-norm of the standard density ratio r(x), and H^Hoo be 
the sup-norm of the a-relative density ratio, i.e., 



a r 



+ 1 - a 



The variance of PE Q is denoted as V[PE a ]. Then, under the regularity condition for the 
asymptotic normality, we have the following upper bound o/V[PE Q ]: 



V[PE a ] = -Y p(x) 



ar- r 



IV 



pi {at) 



1 - a)ri 



< 



' a Woo 

n 



+ 



a 2 \\r, 11 1 



An 



"II xi I 0- a 



alloo 



An' 



+ o 



1 1 

+ ol -, - 

• n n 



1 1 



n n 
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Theorem 5 The variance o/PE a is denoted as V[PE a ]. Let Vg be the gradient vector of 
g with respect to 6 at 6 = 6*, i.e., (X7g(x;0*))j = dg ^/* ^ ■ The matrix U a is defined by 

U a = aE p{x) [VgVg T ] + (1 - a)E p , (x) [VgVg T ]. 
Then, under the regularity condition, the variance o/PE a is asymptotically given as 



V[PE n 



n 



p(x) 



r a + (1 - ar Q )E p(a;) [V(7] T ?7- 1 V(7 



1 

n' 



(l-a)r Q E p(a;) [V#£/- 1 V< ? 



+ o 



1 1 



n n' 



B.2 Proof of Theorem |U 

Let be the estimated parameter, i.e., g(x) = g(x;6). Suppose that r a (x) = g{x;6*) € G 
holds. Let 56 = — 6* , then the asymptotic expansion of PE a is given as 

n , n n' 

pE a = 1 £ fa* 5 ) ~ I - E fa^f + ^ £ M -Co? - \ 

n 2 n ^-^ n ^-^ J 2 

8=1 k 1=1 j=l J 

I n \ n 

= PE Q + - J> a (^) " E p(-) M) + - £ V ^ 6 *^ 59 

II i=l 71 i=l 

- 2l I E( r «(^) 2 - e p(x)[^]) + EM*;) 2 - By W [rS]) | 

*> i=l j'=l ' 

n 1 — "1 T / 1 1 \ 

- £ r Q (^)V 5 (^; + £ ^K)VffK-; 0*) | 50 + oj — , — J . 

i=i j=i 



Let us define the linear operator G as 



n 

* i=l 



Likewise, the operator G' is defined for the samples from p'. Then, we have 
PE a — PEq, 



n 



+ ( E p(x)[V3] - aE p(3 ,)[r a Vc/] - (1 - a)E p / (;c) [r a V5]} T <50 + a 



1 1 



a 9 
■r z 

n 2 vn 



G(r a - %rl) 



1 , / 1 — a 



G" 



I) + o P 



1 1 



The second equality follows from 

E P ( x )[Vg] ~ a~E p{x) [r a X7g} - (1 - a)E p , (a ,) [r a Vfif] = 0. 
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Then, the asymptotic variance is given as 



1, 



V[PE Q ] = — V, 



n 



p(as) 



a 2 

r a ~ ~2 r a 



IV 



1 — a 9 

r 

2 a 



1 1 
+ o[ -, - 

v ra ra' 



(27) 



We confirm that both r a — ^|r 2 and "^^r 2 are non-negative and increasing functions with 
respect to r for any a £ [0, 1]. Since the result is trivial for a = 1, we suppose < a < 1. 
The function r a — 77 r 2 is represented as 

a 2 r(ar + 2 — 2a) 
r ° ~ 2 r ° = 2(ar + l-a) 2 ' 

and thus, we have r a — ^r 2 = for r = 0. In addition, the derivative is equal to 

d r(ar + 2 - 2a) (1 - a) 2 



<9r 2(ar + 1 — a) 2 (ar + 1 — a) 3 ' 

which is positive for r > and a £ [0, 1). Hence, the function r a — ^r 2 is non- negative and 
increasing with respect to r. Following the same line, we see that ^-^r 2 is non-negative 
and increasing with respect to r. Thus, we have the following inequalities, 



/ \ ^* / \ 9 ii i 

< r a (x) - -r a (x) < \\r a \ 



a II II 2 



< 



1 — a 



\2 ^ ~~ a ii i|2 



2 " v ' - 2 
As a result, upper bounds of the variances in Eq. (|27p are given as 

2 



V. 



p(x) 



V p'(a;) 



a n 

1 — a n 



5: ( 1 1 r « 1 1 oo 2 1 1 ra 1 1 oo ; • 



< 



1-a) 



r II 4 
alloo- 



Therefore, the following inequality holds, 



< 







1 1 


r a 


ra \ 






2 


\\ r a\ 


oo 


n 



a r, 



> 2 n 2 i MM fi i 

n n' 



alloo 



+ 



a r, 



alloo 



4n 



+ 



(l-a) 2 ||r, 



4 

alloo 



4n' 



+ o 



1 1 



> / I > 



n n 



which completes the proof. 
B.3 Proof of Theorem [5] 

The estimator 6 is the optimal solution of the following problem: 



mm 

eee 



1 n 1 n i n 

- y: ^) 2 + ^ £(i - ) 2 - - E » to «) 



8=1 
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Then, the extremal condition yields the equation, 

n _. n' -\ 11 

^ g(x i ;d)Vg(x t ;d) + 9(^0)Vg(x' j; d) - - £ Vg( Xl ; 9) = 0. 

i=i j=i «=i 

Let 50 = 6 — 6* . The asymptotic expansion of the above equation around 9 = 9* leads to 

\ n 1 n ' / 1 1 \ 

- V(ar Q (^) - l)V^(x i; 0*) + V raix'^Vgix'j- 9*) + U a 59 + o p (—, — ) = 0. 

Therefore, we obtain 

59 = -Lg((1 - arJU^Vg) - 4=G'((1 - 0)^17^ g) + o p f^=, -±=\ . 
V n vn' yV n vn'/ 

Next, we compute the asymptotic expansion of PE a : 
1 1 n 

PE a = -E p{x) [r a ] + — ^2(r a ( Xi ) - E p{x) [r a }) 



i=l 



1 11 1 / 1 1 \ 

1 = 1 



Substituting <50 into the above expansion, we have 

1 



PE Q — PE a 



^=G(r a + (1 - ar a )E p(x) [Vg] 1 ^V^) 
" ^=G'((1 - a)r a E p{x) [V g VU-JVg) + 0p (J=,-^. 



As a result, we have 



V[PEJ = -V. 



n 



p(x) 



r Q + (l-ar Q )E p(a;) [V 5 ] T t/- 1 V 5 



(l-a)r a E p(x) [Vg] l U- L Vg 







+ o(-, 


-) 




n'J 



TV 

which completes the proof. 
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